AITopics | prosody transfer

Collaborating Authors

prosody transfer

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning

Hira, Medha, Goel, Arnav, Gupta, Anubha

arXiv.org Artificial IntelligenceJun-18-2024

This paper presents CrossVoice, a novel cascade-based Speech-to-Speech Translation (S2ST) system employing advanced ASR, MT, and TTS technologies with cross-lingual prosody preservation through transfer learning. We conducted comprehensive experiments comparing CrossVoice with direct-S2ST systems, showing improved BLEU scores on tasks such as Fisher Es-En, VoxPopuli Fr-En and prosody preservation on benchmark datasets CVSS-T and IndicTTS. With an average mean opinion score of 3.6 out of 4, speech synthesized by CrossVoice closely rivals human speech on the benchmark highlighting the efficacy of cascade-based systems and transfer learning in multilingual S2ST with prosody transfer. Transformer-based models (Vaswani et al., 2017) have revolutionized speech processing, leading to significant advancements in automatic speech recognition and text-to-speech technologies (Latif et al., 2023; Prabhavalkar et al., 2023). This shift towards end-to-end systems has opened new avenues in Speech-to-Speech Translation (S2ST) for translating speech across languages.

bleu score, crossvoice, translation, (15 more...)

arXiv.org Artificial Intelligence

2406.00021

Country:

North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
Asia > India > NCT > New Delhi (0.04)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning

Goel, Arnav, Hira, Medha, Gupta, Anubha

arXiv.org Artificial IntelligenceJun-18-2024

The field of prosody transfer in speech synthesis systems is rapidly advancing. This research is focused on evaluating learning methods for adapting pre-trained monolingual text-to-speech (TTS) models to multilingual conditions, i.e., Supervised Fine-Tuning (SFT) and Transfer Learning (TL). This comparison utilizes three distinct metrics: Mean Opinion Score (MOS), Recognition Accuracy (RA), and Mel Cepstral Distortion (MCD). Results demonstrate that, in comparison to SFT, TL leads to significantly enhanced performance, with an average MOS higher by 1.53 points, a 37.5% increase in RA, and approximately, a 7.8-point improvement in MCD. These findings are instrumental in helping build TTS models for low-resource languages.

iclr 2024, prosody transfer, transfer learning, (11 more...)

arXiv.org Artificial Intelligence

2406.00022

Country:

Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.05)
Asia > India > NCT > New Delhi (0.04)

Genre: Research Report > New Finding (0.34)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

A Human-in-the-Loop Approach to Improving Cross-Text Prosody Transfer

Maurya, Himanshu, Sigurgeirsson, Atli

arXiv.org Artificial IntelligenceJun-6-2024

Text-To-Speech (TTS) prosody transfer models can generate varied prosodic renditions, for the same text, by conditioning on a reference utterance. These models are trained with a reference that is identical to the target utterance. But when the reference utterance differs from the target text, as in cross-text prosody transfer, these models struggle to separate prosody from text, resulting in reduced perceived naturalness. To address this, we propose a Human-in-the-Loop (HitL) approach. HitL users adjust salient correlates of prosody to make the prosody more appropriate for the target text, while maintaining the overall reference prosodic effect. Human adjusted renditions maintain the reference prosody while being rated as more appropriate for the target text $57.8\%$ of the time. Our analysis suggests that limited user effort suffices for these improvements, and that closeness in the latent reference space is not a reliable prosodic similarity metric for the cross-text condition.

participant, prosody, target text, (15 more...)

arXiv.org Artificial Intelligence

2406.06601

Country:

North America > Canada > Quebec > Montreal (0.04)
North America > United States (0.04)
Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
(2 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.36)

Add feedback

Towards cross-language prosody transfer for dialog

Avila, Jonathan E., Ward, Nigel G.

arXiv.org Artificial IntelligenceJul-9-2023

Speech-to-speech translation systems today do not adequately support use for dialog purposes. In particular, nuances of speaker intent and stance can be lost due to improper prosody transfer. We present an exploration of what needs to be done to overcome this. First, we developed a data collection protocol in which bilingual speakers re-enact utterances from an earlier conversation in their other language, and used this to collect an English-Spanish corpus, so far comprising 1871 matched utterance pairs. Second, we developed a simple prosodic dissimilarity metric based on Euclidean distance over a broad set of prosodic features. We then used these to investigate cross-language prosodic differences, measure the likely utility of three simple baseline models, and identify phenomena which will require more powerful modeling. Our findings should inform future research on cross-language prosody and the design of speech-to-speech translation systems capable of effective prosody transfer.

artificial intelligence, natural language, utterance, (17 more...)

arXiv.org Artificial Intelligence

2307.04123

Country:

North America > United States > Texas > El Paso County > El Paso (0.04)
North America > Mexico (0.04)
Asia > Japan > Honshū > Tōhoku > Iwate Prefecture > Morioka (0.04)

Genre: Research Report > New Finding (0.48)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Do Prosody Transfer Models Transfer Prosody?

Sigurgeirsson, Atli Thor, King, Simon

arXiv.org Artificial IntelligenceMar-7-2023

Some recent models for Text-to-Speech synthesis aim to transfer the prosody of a reference utterance to the generated target synthetic speech. This is done by using a learned embedding of the reference utterance, which is used to condition speech generation. During training, the reference utterance is identical to the target utterance. Yet, during synthesis, these models are often used to transfer prosody from a reference that differs from the text or speaker being synthesized. To address this inconsistency, we propose to use a different, but prosodically-related, utterance during training too. We believe this should encourage the model to learn to transfer only those characteristics that the reference and target have in common. If prosody transfer methods do indeed transfer prosody they should be able to be trained in the way we propose. However, results show that a model trained under these conditions performs significantly worse than one trained using the target utterance as a reference. To explain this, we hypothesize that prosody transfer models do not learn a transferable representation of prosody, but rather an utterance-level representation which is highly dependent on both the reference speaker and reference text.

artificial intelligence, machine learning, prosody, (17 more...)

arXiv.org Artificial Intelligence

2303.04289

Country:

North America > Canada > Quebec > Montreal (0.04)
Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)

Genre: Research Report > New Finding (0.49)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (0.57)

Add feedback

Analysis and Assessment of Controllability of an Expressive Deep Learning-Based TTS System

#artificialintelligenceFeb-26-2022, 05:10:43 GMT

In this paper, we study the controllability of an Expressive TTS system trained on a dataset for a continuous control. The dataset is the Blizzard 2013 dataset based on audiobooks read by a female speaker containing a great variability in styles and expressiveness. Controllability is evaluated with both an objective and a subjective experiment. The objective assessment is based on a measure of correlation between acoustic features and the dimensions of the latent space representing expressiveness. The subjective assessment is based on a perceptual experiment in which users are shown an interface for Controllable Expressive TTS and asked to retrieve a synthetic utterance whose expressiveness subjectively corresponds to that a reference utterance.

representation, speech synthesis, synthesis, (12 more...)

#artificialintelligence

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.88)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.32)

Add feedback

Neural Text-to-Speech Makes Speech Synthesizers Much More Versatile : Alexa Blogs

#artificialintelligenceSep-11-2019, 16:03:47 GMT

A text-to-speech system, which converts written text into synthesized speech, is what allows Alexa to respond verbally to requests or commands. Through a service called Amazon Polly, text-to-speech is also a technology that Amazon Web Services offers to its customers. Last year, both Alexa and Polly evolved toward neural-network-based text-to-speech systems, which synthesize speech from scratch, rather than the earlier unit-selection method, which strung together tiny snippets of pre-recorded sounds. In user studies, people tend to find speech produced by neural text-to-speech (NTTS) systems more natural-sounding than speech produced by unit selection. But the real advantage of NTTS is its adaptability, something we demonstrated last year in our work on changing the speaking style ("newscaster" versus "neutral") of an NTTS system.

artificial intelligence, optical character recognition, vocoder, (15 more...)

#artificialintelligence

Industry: Information Technology > Services (0.99)

Technology:

Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Synthesis (1.00)

Add feedback